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Abstract 

In a recent article, Behrens and Vingron (JCB 17, 12, 2010) compute waiting 
times for k-mers to appear during DNA evolution under the assumption that the 
considered /c-mers do not occur in the initial DNA sequence, an issue arising when 
studying the evolution of regulatory DNA sequences with regard to transcription 
factor (TF) binding site emergence. The mathematical analysis underlying their 
computation assumes that occurrences of words under interest do not overlap. 
We relax here this assumption by use of an automata approach. In an alphabet 
of size 4 like the DNA alphabet, most words have no or a low autocorrelation; 
therefore, globally, our results confirm those of Behrens and Vingron. The out- 
come is quite different when considering highly autocorrelated fe-mers; in this 
case, the autocorrelation pushes down the probability of occurrence of these k- 
mers at generation 1 and, consequently, increases the waiting time for apparition 
of these fe-mers up to 40%. An analysis of existing TF binding sites unveils a 
significant proportion of fc-mers exhibiting autocorrelation. Thus, our computa- 
tions based on automata greatly improve the accuracy of predicting waiting times 
for the emergence of TF binding sites to appear during DNA evolution. We do 
the computation in the Bernoulli or MO model; computations in the Ml model, 
a Markov model of order 1, are more costly in terms of time and memory but 
should produce similar results. While Behrens and Vingron considered specifi- 
cally promoters of length 1000, we extend the results to promoters of any size; 
we exhibit the property that the probability that a k-mei occurs at generation 
time 1 while being absent at time behaves linearly with respect to the length of 
the promoter, which induces a hyperbolic behaviour of the waiting time of any 
fc-mer with respect to the length of the promoter. 
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1 Introduction 



The expression of genes is subject to strong regulation. The key concept of transcrip- 
tional gene regulation is the binding of proteins, so called transcription factors (TFs), 
to TF binding sites. These TF binding sites are typically short stretches of DNA, many 
of which are only around 5-8bp long (Wray et al. (2003)). Usually, these TF binding 
sites are located in a region around lOOObp upstream of the gene they regulate, the so 
called promoter. Thus, the occurrence of particular /c-mers in these promoter regions 
has a high impact on modulating transcription. There have been several experimen- 
tal studies employing ChlP-chip or ChlP-seq technology showing that promoters are 
rapidly evolving regions that change over short evolutionary time scales (Odom et al. 
(2007), Schmidt et al. (2010), Kunarso et al. (2010)). In a recent review, Dowell (2010) 
summarizes all these experimental findings and concludes that most TF binding events 
are species-specific and that gene regulation is a highly dynamic evolutionary process. 
Many of these changes in TF binding, if not necessarily all, can be explained by gains 
and losses of TF binding sites. 

Several theoretical studies have tried to give a probabilistic explanation for the speed 
of changes in transcriptional gene regulation (e.g. Stone and Wray (2001), Durrett and 
Schmidt (2007)). Behrens and Vingron (2010) infer how long one has to wait until a 
given TF binding site emerges at random in a promoter sequence. Using two different 
probabilistic models (a Bernoulli model denoted by M0 and a neighbor dependent 
model Ml) and estimating evolutionary substitution rates based on multiple species 
promoter alignments for the three species Homo sapiens, Pan troglodytes and Macaca 
mulatta, they compute the expected waiting time for every /c-mer, k ranging from 5 to 
10, until it appears in a human promoter. They conclude that the waiting time for a 
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TF binding site is highly determined by its composition and that indeed TF binding 
sites can appear rapidly, i.e. in a time span below the speciation time of human and 
chimp. 

However, in their approach, Behrens and Vingron (2010) rely on the assumption 
that if a k-mer of interest appears more than once in a promoter sequence, it does not 
overlap with itself. This particularly affects the waiting times for highly autocorrelated 
words like e.g. AAAAA or CTCTCTCTCT. Using automata, we can relax this assumption 
and, thus, more accurately compute the expected waiting times until appearance for 
every /c-mer, k ranging from 5 to 10, in a promoter of length lOOObp. This automa- 
ton approach can be applied both for models MO and Ml. However, for the ease of 
exposition, in this article we will focus on the Bernoulli model MO. 

This article is structured as follows. In Section 2, we describe model MO, state 
results from Behrens and Vingron (2010) that we rely on and recall how Behrens and 
Vingron (2010) have estimated model M0 parameters based on human, chimp and 
macaque promoter alignments. In Section 3, we present our new approach of comput- 
ing waiting times using automata theory; we provide in this section a web-pointer to 
the program used to perform these computations. Section 4 compares the results of 
computing waiting times for fc-mers to appear in a promoter of length 1 kb according to 
Behrens and Vingron (2010) and to our new automaton approach. For both computa- 
tions, we employ the same model parameters estimations that have been already used 
in Behrens and Vingron (2010); we also explain in this section the biological impact 
of our findings and show that autocorrelation matters in the context of TF binding 
site emergence. Section 5 exhibits the first order linear behaviour of the probability 
of evolution to a k-mer from generation time to time 1 for specific examples; the 
observed phenomena is however general, as proved in Nicodeme (2011). We provide 
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in this section a web-pointer to a database containing the waiting times of all k-mers 
for k from 5 to 10 and for promoter lengths n = 1000 and n = 2000. Section 6 will 
conclude the article with some summarizing remarks. 
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2 Model MO and expected waiting times 

Throughout the article, we assume that promoter sequences evolve according to model 
MO which has been described by Behrens and Vingron (2010). 

Model MO. Given an alphabet A = {A,C,G,T}, let S(0) = (Si(0), . . . , S n {0)) denote 
the initial promoter sequence of length n taking values in this alphabet. We assume that 
the letters in S(0) are independent and identically distributed with u(x) := Pr(5i(0) = 
x). Let the time evolution (S(t)) t > of the promoter sequence be governed by the 4x4 
infinitesimal rate matrix Q = {r a ^) a ^ & j,. According to the general reverse complement 
symmetric substitution model, we assume that the nucleotides evolve independently 
from each other and that ta s t = r T ,A, r c ,G = r GtC , r A ,c = r T ,c, r c ,A = r GjT , r A ,G = r T ,c 
and tc,a — r c,T (see also Duret and Arndt (2008)). Thus, there are 6 free parameters. 
The matrix P(t) = (p a ,p(t))a,peA containing the transitions probabilities of a evolving 
into (3 in finite time t > 0, (a, (3 G A), can be computed by F(t) = e t<Q ; see Karlin and 
Taylor (1975), p. 150-152. 

The expected waiting time. Given a binding site 

b = (h,..., b k ) where b x , . . . , b k G A, (1) 

the aim is to determine the expected waiting time until b emerges in a promoter se- 
quence of length n provided that it does not appear in the initial promoter sequence 
S(0) . More precisely, let 

T n = inf{t G N : 3i G {1, . . . ,n — k + l} such that (S^t), . . . , S i+k ^(t)) = (b u . . . , b k )}. 

(2) 
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Then, given that Pr(6 occurs in S(0)) = 0, T n has approximately a geometric distribu- 
tion with parameter 

p n = Pr(6 occurs in generation 1 | b does not occur in generation 0) (3) 
= Pr(beS(l) | b<?S(0)) 

as shown by Behrens and Vingron (2010). In particular, one has 

E(T B ) « 1. (4) 

Estimating the parameters of model MO. For our analyses, we used the same 
parameter estimations as Behrens and Vingron (2010). The estimations for u(a), 
a £ A, have been obtained by determining the relative frequencies of A, C, G and T 
in human promoter regions downloaded from UCSC. The substitution rates r a ^ have 
been estimated using multiple alignments from UCSC of chimp and macaque DNA 
sequences to human promoters and by employing the Maximum likelihood based tool 
developed by Arndt and Hwa (2005). Afterwards, the transition probabilities p a ,p(t) 
for e.g. t — 1 generation can be easily computed by the matrix exponential P(t) = e^. 
Assuming a speciation time between human and chimp of 4 Million of years and a 
generation time of y = 20 years, Behrens and Vingron (2010) obtain estimations for 



Pa,piX) = Pa,p0- generation) for all a, (3 G A. Their results are summarized in Table 1. Table 1 
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3 Automaton approach 

The aim of this section is to provide a new procedure to compute the expected waiting 
time E(T n ) until a TF binding site b of length k emerges in a promoter sequence of 
length n by using Equation (4), i.e. E(T n ) ps Behrens and Vingron (2010) approxi- 
mated p n = Pr(b occurs in generation l\b does not occur in generation 0) by applying 
the inclusion-exclusion principle. However, in order to make the computations feasible, 
they had to assume that b cannot appear self-overlapping which especially adulterates 
the actual waiting times for autocorrelated words. Automata theory provides a natural 
and compact framework to handle autocorrelations easily; in this section we present 
how to use basic automata algorithms in order to compute the probability p n without 
resorting to the assumption that b occurs non-overlapping. 

Definitions. In this section, only definitions that will be used in the sequel are 
recalled; more information about automata and regular languages can be found in 
Hopcroft et al. (2001). Given a finite alphabet A, a deterministic and complete au- 
tomaton on A is a tuple (Q, 5, qo,F), where Q is a finite set of states, 5 is a mapping 
from Q x A to Q, qo G Q is the initial state and F C Q is the set of final states. Let e de- 
note the empty word. The mapping 5 can be extended inductively to Q x A* by setting 
S(q, e) = q for all q G Q and, for all q £ Q, u £ A* and a £ A, S(q, ua) = 5(5(q, u),a). 
A word u G A* is recognized by the automaton when 5(qo,u) G F. The language 
recognized by the automaton is the set of words that are recognized. 

Since all automata considered in the sequel are deterministic and complete, we will 
call them "automata" for short. Automata are well represented as labelled directed 
graphs, where the states are the vertices, and where there is an edge between p and q 
labelled by a letter a G A if and only if 5(p, a) = q; such an edge is called a transition. 
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The initial state has an incoming arrow, and final states are denoted by a double circle. 
See Figure 1 for an example of such a graphical representation. A word u is recognized 
when starting at the initial state and reading u from left to right, letter by letter, and 
following the corresponding transition, one ends in a final state. 

Rewording the problem. Consider the alphabet B = Ax A. Letters of B are pairs 
(a, (3) of letters of A, which are represented vertically by (^). A word u of length n 
on B is also seen as a pair of words of length n over A, and represented vertically: if 
u = (ai, /?i)(a 2 , P2) ■ ■ ■ ( a n,/3n)i we shall write u = («*'"'«")■ For any word u = (^) of 
B*, the projections ttq and 7Ti are defined by tto(u) = v and tti(u) = w. 

For the problems considered in this article, we have A = {A,C,G,T}, and a word 
u = r) of length n over B represents the sequence that was initially equal to v and 
that has evolved into w at time 1; that is, S(0) = vto(m) and S'(l) = iri(u). The 
main problem can be reworded using rational expressions: for a given b = bi ■ ■ -bk, 
the fact that b appear in 5*(1) but not in S(0) is exactly the condition ni(u) G A*bA* 
and 7t (m) ^ A*bA* . We denote by the set of such words and remark that is a 
rational language. 

Construction of the automaton. The smallest automaton M.^ that recognizes 
the language A*bA* can be built using the classical Knuth- Morris- Pratt construction 
(see Crochemore and Rytter (1994), chapter 7). This requires for any k-mer O(k) time 
and space, and the produced automaton M. h = ({0, . . . , k}, 5b, 0, {k}) has exactly k + 1 
states. 

The language ^4* \ A*bA* is the complement of the previous one, and is therefore 
recognized by the automaton M. h = ({0, . . . , k}, 5b, 0, {0, . . . , k — 1}), which has the 
same underlying graph as A4b and whose set of final states is the complement of M.b§ 
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one. For the examples given in this section, we use a smaller alphabet A = {A,C} 
and the k-mer is always b = ACC, (hence k = 3). The two automata are depicted in 



Figure 1. To fully describe the language we use the classical product automaton Figure 1 
construction, tuned to fit our needs. Define the automaton Mb = {Q, 8, qo, F) as follows: 

• The set of states is Q — {0, . . . , k} x {0, . . . , k}. The states of Mb are therefore 
pairs (p, q), where intuitively p lies in M. b and q lies in 

• The initial state is q = (0, 0). 

• The transition mapping 5 is defined for every (p, q) G Q and every (a, 0) G B by 



<J((p, <?), (a, /?)) = a), <5b(<?, /?))• The idea is to read 7t (m) in .Mb on the first 
coordinate, and Wi(u) in M. b on the second coordinate. 

• A state (p, q) is final if and only if both p and q are final in their respective 
automata, that is, F = {0, . . . , k — 1} x {&;}. 

The proof of the following lemma follows directly from the construction of Mb' 

Lemma 3.1 The automaton M b recognizes the language Cb- 

Looking closer at the automaton one can make the following observations: while reading 
a word u of B* in Mb, if one reaches a state of the form (p, k) at some point, for some 
p G {0, . . . , k}, then all the remaining states on the path labelled by u are also of the 
form (q, k), for some q G {0, . . . , k}. This is because 5b(k, a) = k for every a G A. 
Since this state is not final, this means that whenever the second coordinate is k at 
some point, the word is not recognized because iro(u) contains b. We can therefore 
simplify the automaton Mb by merging all the states of the form (p, k) into a single 
state, which we name sink. Let Ml = (Q', 5', q' , F') denote this new automaton, which 
has k 2 + k + 1 states. Lemma 3.2 below states that all the information we need is 
contained in Ml- See an example of this automaton in Figure 2. Figure 2 
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Lemma 3.2 Let u be a word in B* , and let q u be the state reached after reading u in 
Ml from its initial state. The words u can be classified as follows: 

• if q u G F' then n (u) does not contains b but Wi(u) does (this is a success in our 
settings ); 

• if q u is the sink state then vto(m) contains b (this is contradictory in our settings); 

• if Qu ^ F' and q u is not the sink state, then neither n (u) nor ix\{u) contains b 
(this is a failure in our settings). 

From automata to probabilities. The automaton Ml is readily transformed into 
a Markov chain, by changing the label of any transition q A q', where a — G B, 
into the probability v(a) x ^^(l). If there are several transitions from q to q', the 
edge is labelled by the sum of the associated probabilities. Let Cb denote this Markov 
chain. The random variable Q n associated to the state reached after reading a random 
word of size n under the MO model is formally defined by: 

Vg G Q', Pr (Q n = q)= ^ u(v) x p v ^ w (l). (5) 

Then, if Fj, is the transition matrix of Cb and if V q is the probability vector with 1 on 
position q G Q' and elsewhere, the random state Q n reached from the initial state 
after n steps verifies 

Vg G Q', Pr {Q n = q) = VS xP?x V q . (6) 
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From this and by Lemma 3.2 we can compute all the needed probabilities : 

Pr (5(1) 6 AW I S® f Aw) = Pr(g(1) ^ A'bA') (7) 

Pr(Q n eF') 



Pr(Q„ = sink) 

V} Q x P™ x V sink 



(9) 



We therefore get our main result. 



Theorem 3.3 Let b G A k and HI = (Q f , 5', q' , F') be its automaton, with associated 
matrix P&. The probability p n that a sequence of length n contains b at time 1 given 
that it does not contains b at time is exactly 

p„ = Pr (Sd) £ ,4*M* | S(0) * «) = - ^ — 



Applying Theorem 3.3 and Equation (4), we obtain that the expected waiting time 
E(T n ) w -p until a binding site 6 of length k appears in a promoter of length n can be 
approximated by 

i i K5 x p™ x v; ink 

E T B «- = — , v = * , v . (10) 

P« Pr (S(l) e A*bA* | 5(0) i A*bA*\ xPJx ( £ geF , VU 

Complexity. The automaton J\f b ', and the associated Markov chain can be built 
in time and space 0(\A\ 2 k 2 ). Once done, the whole calculation reduces to the com- 
putation of the row vector V£ x P£, which can be done iteratively using the simple 
relation 



V} q x P* +1 = (V q \ x P*) xP b . 



row vector 
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Hence this consists of n products of a vector by a matrix. Moreover, this matrix is a 
square matrix of dimension k 2 + k + l, which is sparse since it has exactly (/c 2 + /c+l)|*4.| 2 
non-zero values. Therefore, the probability of Theorem 3.3 can be computed in time 
0(n x k 2 x \A\ 2 ), using 0(\A\ 2 k 2 ) space. 

Web aCCeSS tO the COde. URL http://www.lix.polytechnique.fr/Labo/Pierre.Nicodeme/BNN/kmer.c 

provides the C code used in this section. 
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4 Biological results 

Applying Equation (10) for obtaining the automaton results and using Theorem 1 from 
Behrens and Vingron (2010), we computed the expected waiting time E(T 10 oo) of all 
fc-mers in the MO model for k from 5 to 10 to appear in a promoter sequence of length 
1000 bp. The parameters of model M0 have been estimated as described in Section 2 
and are depicted in Table 1. 

Figure 3 provides an overall comparison of the waiting time computed by automata 
with respect to the previous computations of Behrens and Vingron (2010) for k = 5 



and k = 10. As can be observed in this scatterplot, the computed waiting times Figure 3 
based on the automaton approach globally confirm the results of Behrens and Vingron 
(2010). However, there are some outliers exhibiting longer waiting times than pre- 
dicted by Behrens and Vingron (2010). The four most extreme outliers that deviate 
from the bisecting line correspond to AAAAA, TTTTT, CCCCC, GGGGG and to AAAAAAAAAA, 
CCCCCCCCCC, GGGGGGGGGG, TTTTTTTTTT respectively. Other outliers are fc-mers like 
e.g. CGCGC, TCTCT and CGCGCGCGCG, TCTCTCTCTC. Tables 2, 3 and 4 show all 5-, 7- and 
10-mers for which > 1-05 where E BV (T 100 o) denotes the expected waiting 

time according to Behrens and Vingron (2010) and E B nn(^iooo) according to our au- 
tomaton approach, i.e. /c-mers with significantly longer waiting times than predicted 
by Behrens and Vingron (2010). Table 2 

We use in the following the million of generations (in short Mgen) as unit of time, 
where a generation is 20 years. The discrepancy between the two procedures can attain 
up to around 40%, e.g. CCCCC has a discrepancy of 44% with E BNN (T 100 o) — 9.105 Mgen 
and Ebv(^iooo) = 6.304 Mgen, CCCCCCC a discrepancy of 43% with E BN n(^iooo) = 
93.457 Mgen and E BV (Ti 00 o) = 65.518 Mgen, and CCCCCCCCCC has a discrepancy of 



Table 3 



Table 4 
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41% with Ebnn(Tiooo) = 3577.003 Mgen and E BV (Ti oo) = 2545.561 Mgen. Strikingly, 
most of the /c-mers with significant discrepancy feature a high autocorrelation, i.e. 
they can appear overlapping in so called clumps. For example, the 5-mer CCCCC could 
appear twice in the clump CCCCCC (at positions 1 and 2), CGCGC could appear three 
times in the clump CGCGCGCGC (at positions 1, 3 and 5). In order to distinguish between 
different levels of autocorrelation of fc-mers, let 

V{b) :— {p G {1, . . . , k — 1} : 6j = b i+p for all i — 1, . . . , k — p} 

denote the set of periods of a k-mer b = (&!,...,&*.). A k-mer b is called non-periodic or 
non-autocorrelated if and only if V{b) = 0. Furthermore, for a periodic k-mer b let po{b) 
denote its minimal period. For example, p (CCCCC) = 1, p (CGCGC) = 2, p (CGACG) = 3 
and po(CGATC) = 4. We then call a word p-periodic if and only if its minimal period is 
p. As can be observed in Tables 2, 3 and 4, half of the 5-mers, two-thirds of the 7-mers 
and all of the 10-mers with Ebnn(Jiooo) > 1.05 are either 1- or 2-periodic, i.e. show a 

&BV(1 1000 j 

high degree of autocorrelation. 

Behrens and Vingron (2010) already investigated the speed of TF binding site 
emergence and its biological implications for the evolution of transcriptional regulation 
in detail and we do not want to elaborate on this again. However, in line with Behrens 
and Vingron (2010), we want to emphasize that the speed of TF binding site emergence 
is primarily influenced by its nucleotide composition. The goal in the following will 
be to investigate the impact of autocorrelation regarding TF binding sites. More 
precisely, we want to answer the question: Do existing TF binding sites show significant 
autocorrelation or can this aspect be neglected when studying the speed of TF binding 
site emergence? 

To investigate this, starting from the JASPAR CORE database for vertebrates 
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Version 4 (Portales-Casamar et al. (2010)), we extracted all the human TF binding 
sites of length k, 5 < k < 10, ending up with a set of 37 position count matrices 
(PCMs) for the 37 different TFs in analogy to Behrens and Vingron (2010). In order 
to make these PCMS accessible for our framework based on /c-mers, we converted a 
PCM into a set of /c-mers by setting a threshold of 0.95 of the maximal PCM score 
and extracted all fc-mers with a score above this threshold. For example, the PCM 
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of the TF SP1 is then translated into the following set of 10-mers: {CCCCACCCCC, 
CCCCCCCCCC, CCCCGCCCCC, CCCCTCCCCC}. Applying this procedure, in total we obtain 
372 different JASPAR fc-mers, 5 < k < 10, for the 37 different human TFs. We then 
screened all JASPAR fc-mers for 1-periodicity, 2-periodicity,..., {k — l)-periodicity. To 
evaluate the degree of autocorrelation of a given JASPAR TF given by its set of k- 
mers, we then computed the proportion of 1-periodic, 2-periodic,..., (k — l)-periodic 
and of non-periodic fc-mers in this set. The results are depicted in Figure 4. As 
can be seen, some TFs like SP1, FOXL1, YY1, GATA3, GATA2 and ETS1 exhibit 
a high autocorrelation while 14 of the 37 TFs show no autocorrelation at all (USF1, 
SPI1,..., API). In order to test whether autocorrelated fc-mers are enriched among 
JASPAR TF binding sites, as a background we screened all possible /c-mers, i.e. all 
b = (&!,..., bk) G A k , A = {A,C,G,T}, k ranging from 5 to 10, for autocorrelation 
in the same way as JASPAR fc-mers. The resulting proportions of periodic and non- 
periodic words of this background are also depicted in Figure 4. In total, among the 
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JASPAR fc-mers, there are 168 autocorrelated words (i.e. words that are p-periodic 
for one p G {1, . . . , k — 1}) and 204 non-autocorrelated words. The background set 
contains 435,828 autocorrelated and 961,932 non-autocorrelated /c-mers. Performing 
Fisher's Exact Test for Count Data with the alternative "greater", we obtain a p- value 
of 1.119e-08. We can thus conclude that autocorrelated words are significantly enriched 
among JASPAR fc-mers. Consequently, existing TF binding sites indeed feature a 
significant proportion of autocorrelation. 
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5 Linear behaviour of ^ n 

Figure 5 

In Section 3 we considered by automata a parallel computation on two sequences, 5(0) 
and S(l). 

It is possible to do a relevant mathematical analysis with the random sequence 5(0) 
only. The corresponding computations have however a much higher complexity than 
the automaton approach. This analysis is defined on counting in a random sequence 
5(0) the number of putative-hit positions where, given a fc-mer b, a putative-hit position 
is any position of 5(0) that can lead by mutation to an occurrence of b is 5(1), assuming 
that a single mutation has occurred. 

For any k-mer b Nicodeme (2011) provides a combinatorial construction using 
clumps (see Bassino et al. (2008)) that (i) considers all the sequences that avoid the 
fc-mer b, and (ii) counts all the putative-hit position in these sequences. 

In the following, let H n denote the number of putative-hit positions in a sequence 
5(0) randomly chosen within the set of sequences of length n that do not contain the 
fc-mer b, where the letters are drawn with respect to the distribution v and where 
we put a probability mass 1 to the set '. As a consequence of singularity analysis of 
rational functions, Nicodeme (2011) proves that 

E(H n ) = Cl xn + c 2 + 0(A n ) (A < 1). (11) 

It is clear that, using the asymptotic Landau's G notation, we do not have 

p n = Q(E(H n )), 

^This is done by unconditioning with respect to the fact that b does not occur in S(0), i.e by 
dividing the resulting expressions by Pr(S(0) ^ A*bA*); see Equation (7). 
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since, for n large enough, this would imply that p n > 1. However, for 

max (p ai g(l)) <C 1 and n <C l/max (p a 

the probability that two or more putative-hit positions simultaneously mutate to pro- 
vide the k-mer b in sequence S(l) is an event of second order small probability. With 
these conditions, we have 

Pn « pb,u, P x E(if n ) = p 6il/(P x (ci xii + c 2 ) + 0(A n ), (12) 

where p&^p is a constant of the order of magnitude of the constants p a> p(l) with a ^ 
(3, its value depending upon these constants, the distribution v and the correlation 
structure of the /c-mer b. See Figure 5 for examples. 

Available data. URL http: / /www. lix.polytechnique . f r/Labo/Pierre . Nicodeme/BNN/Waitf orkmers . tar . gz 
provides access to the values of the expected waiting time E(T„) and the probability for n = 1000 
and n = 2000 for all fc-mcrs with k from 5 to 10. It is therefore possible to compute p„ and E(T„) for 
all these fc-mers for all n from these data. It took 10 hours to compute the data. 
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6 Conclusion 



Using automata theory, we have developed a new procedure to compute the waiting 
time until a given TF binding site emerges at random in a human promoter sequence. 
In contrast to Behrens and Vingron (2010), we do not have to rely on any assumptions 
regarding the overlap structure of the TF binding site of interest. Thus, our computa- 
tions are more accurate. Assuming model MO, whose parameters have been estimated 
in the same way as in Behrens and Vingron (2010), applying our automaton approach 
to all k-mers, k ranging from 5 to 10, and comparing the resulting expected waiting 
times to those obtained by Behrens and Vingron (2010), we particularly observe that 
highly autocorrelated words like CCCCC or AAAGG actually tend to emerge slower than 
predicted by Behrens and Vingron (2010). This slowdown can attain up to 40%, e.g. 
according to Behrens and Vingron (2010), CCCCC is predicted to be created in a human 
promoter of length 1 kb in around 6.304 Mgen while our more accurate method pre- 
dicts it be generated in around 9.105 Mgen. We have shown that existing TF binding 
sites (from the database JASPAR; Portales-Casamar et al. (2010)) feature a signifi- 
cant proportion of autocorrelation. Therefore the assumption of Behrens and Vingron 
(2010) that TF binding sites do not appear self-overlapping when computing waiting 
times is problematic. The new automaton approach now incorporates the possibility of 
TF binding sites appearing self-overlapping into the model. Hence, the automaton ap- 
proach highly improves the accuracy of the estimations for waiting times. We observed 
a linear behaviour with respect to the length of the promoters for the probability of 
finding a k-mer at generation 1 that is not present at generation 0. This implies a 
highly flexible and efficient approach for computing this probability for any promoter 
length, and in particular for lengths of highest interest, i.e. between 300 and 3000 bp. 



20 



This also induces a hyperbolic behaviour for the waiting time. 
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Figure 2: The automaton M' ACC . For the automaton to be readable, we use the 
notations A = Cfj , A = (^) , C — (^) and C = (^) . When the label of a transition is 
not given, it is by default set to the letter at the bottom of its ending state. 




Figure 3: Overall comparisons of waiting times of Behrens and Vingron 
(2010) (BV) versus the automata method (BNN) for 5- and 10-mers. 
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Figure 4: Barplot of the degree of autocorrelation of JASPAR TF binding 
sites. For every of the 37 JASPAR TFs each given by a set of fc-mers, the proportion 
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k — 1. Additionally, the same proportions were computed for all possible /c-mers, k 
ranging from 5 to 10 ("Background"). 
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Table 1: Parameter estimations. Numbers taken from Behrens and Vingron (2010), 
Supplementary Material S2. 
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Table 2: Expected waiting times (generations) for 5-mers in model MO with 
^vcShx^ > 1-05- Ebv(Tiooo) denotes the expected waiting time according to Behrens 
and Vingron (2010) (BV) and Ebnn(^iooo) according to our automaton approach 
(BNN). Ranks refer to 5-mers sorted by their waiting time of appearance according to 
the two different procedures BV and BNN; rank 1 is assigned to the fastest evolving 
5-mer, rank 1024 (=4 5 ) to the slowest emerging 5-mer. 
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Table 3: Expected waiting times (generations) for 7-mers in model MO with 
E i^Z°) > L05 - E bv(T 10 oo) denotes the expected waiting time according to Behrens 
and Vingron (2010) (BV) and E B nn(^iooo) according to our automaton approach 
(BNN). Ranks refer to 7-mers sorted by their waiting time of appearance according to 
the two different procedures BV and BNN; rank 1 is assigned to the fastest evolving 
7-mer, rank 16384 (=4 7 ) to the slowest emerging 7-mer. 
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Table 4: Expected waiting times (generations) for 10-mers in model MO 
with ^r^Tp^r > 1.05. Ebv(Tiooo) denotes the expected waiting time according 
to Behrens and Vingron (2010) (BV) and E B nn(^iooo) according to our automaton 
approach (BNN). Ranks refer to 10-mers sorted by their waiting time of appearance 
according to the two different procedures BV and BNN; rank 1 is assigned to the fastest 
evolving 10-mer, rank 1048576 (=4 10 ) to the slowest emerging 10-mer. 



